Global Socioeconomic Data Analysis: GDP & Population¶

Web Scraping, Cleaning, Analysis & Visualization

Introduction¶

This project analyzes global socioeconomic indicators, focusing on GDP and population trends by country and continent. The goal is to explore relationships between population size, economic performance (GDP), and geographic distribution.

Data was scraped from Wikipedia, cleaned, merged, and visualized using Python.

Technologies Used:

  • Python (pandas, BeautifulSoup, requests)

  • Data Visualization (Matplotlib, Seaborn, Plotly)

  • Data Wrangling (pandas)

Objectives

  • Scrape GDP & Population datasets from Wikipedia.

  • Clean and align the datasets for analysis.

  • Visualize population and GDP trends across continents and countries.

  • Identify socioeconomic disparities.

  • Demonstrate end-to-end data workflow.

Data Sources

Dataset Source URL
GDP by Country Wikipedia: GDP (Nominal) Link
Population by Country Wikipedia: UN Population Estimate Link

Web Scraping Process¶

BeautifulSoup is a Python library used to parse and extract data from HTML or XML files.

pandas.read_html() quickly reads tables from a webpage and converts them into DataFrames. Tables are selected using attributes like class names, tags, or positional index to focus on relevant data.

Import necessary libraries¶

In [208]:
import requests  # Used to send HTTP requests and fetch content from websites

from bs4 import BeautifulSoup  # Parses HTML/XML content to extract and navigate elements for web scraping

import pandas as pd  # Handles tabular data, allows you to store scraped tables in DataFrames and manipulate them

import matplotlib.pyplot as plt  # Basic plotting library for creating charts and graphs (line, bar, scatter, etc.)

import seaborn as sns  # Enhances matplotlib visuals; great for advanced statistical plots and aesthetic styling

import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

Step 1: Scrape GDP Data¶

In [209]:
# Step 1: Scrape GDP data
url_gdp = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"

response = requests.get(url_gdp)
soup = BeautifulSoup(response.content, "lxml")

# Find the first table in the page (inspect manually via browser dev tools)
tables = pd.read_html(response.text)
gdp_table = tables[2]  

# View the table
gdp_table.head()
Out[209]:
Country/Territory IMF[1][12] World Bank[13] United Nations[14]
Country/Territory Forecast Year Estimate Year Estimate Year
0 World 113795678 2025 111326370 2024 100834796 2022
1 United States 30507217 2025 29184890 2024 27720700 2023
2 China 19231705 [n 1]2025 18743803 [n 3]2024 17794782 [n 1]2023
3 Germany 4744804 2025 4659929 2024 4525704 2023
4 India 4187017 2025 3912686 2024 3575778 2023

Step 2: Scrape Population Data¶

In [210]:
# Scrape population data
url_population = "https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)"

response = requests.get(url_population)
soup = BeautifulSoup(response.content, "lxml")

# Find population table
tables = pd.read_html(response.text)
population_table = tables[0]

population_table.head()
Out[210]:
Country or territory Population (1 July 2022) Population (1 July 2023) Change (%) UN continental region[1] UN statistical subregion[1]
0 World 8021407192 8091734930 +0.88% – –
1 India 1425423212 1438069596 +0.89% Asia Southern Asia
2 China[a] 1425179569 1422584933 −0.18% Asia Eastern Asia
3 United States 341534046 343477335 +0.57% Americas Northern America
4 Indonesia 278830529 281190067 +0.85% Asia South-eastern Asia

Data Cleaning & Preparation¶

Cleaning GDP Data:¶

  • Rename columns for clarity
  • Remove non-numeric artifacts (e.g., footnotes [n 1])
  • Convert GDP estimates to Int64
  • Convert years to datetime
  • Drop missing values for consistency
In [211]:
# Restructure the Columns
gdp_table.columns = [
    "Country",
    "IMF_GDP_Billion_USD", "IMF_Year",
    "WorldBank_Estimate", "WorldBank_Year",
    "UN_Estimate", "UN_Year"
]

# Drop the first row
gdp_table = gdp_table.drop(index=0)

display(gdp_table.head())
Country IMF_GDP_Billion_USD IMF_Year WorldBank_Estimate WorldBank_Year UN_Estimate UN_Year
1 United States 30507217 2025 29184890 2024 27720700 2023
2 China 19231705 [n 1]2025 18743803 [n 3]2024 17794782 [n 1]2023
3 Germany 4744804 2025 4659929 2024 4525704 2023
4 India 4187017 2025 3912686 2024 3575778 2023
5 Japan 4186431 2025 4026211 2024 4204495 2023
In [212]:
# List column names
gdp_table.columns
Out[212]:
Index(['Country', 'IMF_GDP_Billion_USD', 'IMF_Year', 'WorldBank_Estimate',
       'WorldBank_Year', 'UN_Estimate', 'UN_Year'],
      dtype='object')
In [213]:
# Remove Footnote Artifacts([n 1])
footnote_cols = ['IMF_GDP_Billion_USD', 'IMF_Year', 'WorldBank_Estimate',
       'WorldBank_Year', 'UN_Estimate', 'UN_Year']

for col in footnote_cols:
    gdp_table[col] = gdp_table[col].astype(str).str.extract(r"(\d{4})")

display(gdp_table.head())
Country IMF_GDP_Billion_USD IMF_Year WorldBank_Estimate WorldBank_Year UN_Estimate UN_Year
1 United States 3050 2025 2918 2024 2772 2023
2 China 1923 2025 1874 2024 1779 2023
3 Germany 4744 2025 4659 2024 4525 2023
4 India 4187 2025 3912 2024 3575 2023
5 Japan 4186 2025 4026 2024 4204 2023
In [214]:
# Display the structure of the columns
gdp_table.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 221 entries, 1 to 221
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Country              221 non-null    object
 1   IMF_GDP_Billion_USD  180 non-null    object
 2   IMF_Year             189 non-null    object
 3   WorldBank_Estimate   198 non-null    object
 4   WorldBank_Year       209 non-null    object
 5   UN_Estimate          200 non-null    object
 6   UN_Year              212 non-null    object
dtypes: object(7)
memory usage: 12.2+ KB
In [215]:
# Define columns
numeric_cols = ["IMF_GDP_Billion_USD", "WorldBank_Estimate", "UN_Estimate"]
year_cols = ["IMF_Year", "WorldBank_Year", "UN_Year"]

# Convert numeric estimates to integers
gdp_table[numeric_cols] = gdp_table[numeric_cols].apply(pd.to_numeric, errors="coerce")
gdp_table[numeric_cols] = gdp_table[numeric_cols].astype("Int64") 

# Clean and convert year columns to datetime
for col in year_cols:
    gdp_table[col] = pd.to_datetime(
        gdp_table[col].astype(str).str.extract(r"(\d{4})")[0], format="%Y"
    )

# Display cleaned dataset summary
print(gdp_table.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 221 entries, 1 to 221
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Country              221 non-null    object        
 1   IMF_GDP_Billion_USD  180 non-null    Int64         
 2   IMF_Year             189 non-null    datetime64[ns]
 3   WorldBank_Estimate   198 non-null    Int64         
 4   WorldBank_Year       209 non-null    datetime64[ns]
 5   UN_Estimate          200 non-null    Int64         
 6   UN_Year              212 non-null    datetime64[ns]
dtypes: Int64(3), datetime64[ns](3), object(1)
memory usage: 12.9+ KB
None
In [216]:
# Check NaNs
print(gdp_table.isna().sum())
Country                 0
IMF_GDP_Billion_USD    41
IMF_Year               32
WorldBank_Estimate     23
WorldBank_Year         12
UN_Estimate            21
UN_Year                 9
dtype: int64
In [217]:
gdp_data = gdp_table.dropna()

# Check if NaNs are removed
print(gdp_data.isna().sum())
Country                0
IMF_GDP_Billion_USD    0
IMF_Year               0
WorldBank_Estimate     0
WorldBank_Year         0
UN_Estimate            0
UN_Year                0
dtype: int64

Cleaning Population Data:¶

  • Rename columns for clarity
  • Drop missing values for consistency
In [218]:
# Restructure the Columns
population_table.columns = [
    "Country", "Population_2022_Count", "Population_2023_Count", "Change_Percentage", "Continent","Region"
]

# Drop the first row
population_data = population_table.drop(index=0)


display(population_data.head())
print(population_data.info())
Country Population_2022_Count Population_2023_Count Change_Percentage Continent Region
1 India 1425423212 1438069596 +0.89% Asia Southern Asia
2 China[a] 1425179569 1422584933 −0.18% Asia Eastern Asia
3 United States 341534046 343477335 +0.57% Americas Northern America
4 Indonesia 278830529 281190067 +0.85% Asia South-eastern Asia
5 Pakistan 243700667 247504495 +1.56% Asia Southern Asia
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 237 entries, 1 to 237
Data columns (total 6 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Country                237 non-null    object
 1   Population_2022_Count  237 non-null    int64 
 2   Population_2023_Count  237 non-null    int64 
 3   Change_Percentage      237 non-null    object
 4   Continent              237 non-null    object
 5   Region                 237 non-null    object
dtypes: int64(2), object(4)
memory usage: 11.2+ KB
None
In [219]:
# Check for blanks
population_data.isnull().sum()
Out[219]:
Country                  0
Population_2022_Count    0
Population_2023_Count    0
Change_Percentage        0
Continent                0
Region                   0
dtype: int64

Step 3: Merge GDP and Population Data¶

In [220]:
# Merge datasets on Country
df_combined = pd.merge(gdp_data, population_data, on='Country', how='inner')
display(df_combined.head())
Country IMF_GDP_Billion_USD IMF_Year WorldBank_Estimate WorldBank_Year UN_Estimate UN_Year Population_2022_Count Population_2023_Count Change_Percentage Continent Region
0 United States 3050 2025-01-01 2918 2024-01-01 2772 2023-01-01 341534046 343477335 +0.57% Americas Northern America
1 Germany 4744 2025-01-01 4659 2024-01-01 4525 2023-01-01 84086227 84548231 +0.55% Europe Western Europe
2 India 4187 2025-01-01 3912 2024-01-01 3575 2023-01-01 1425423212 1438069596 +0.89% Asia Southern Asia
3 Japan 4186 2025-01-01 4026 2024-01-01 4204 2023-01-01 124997578 124370947 −0.50% Asia Eastern Asia
4 United Kingdom 3839 2025-01-01 3643 2024-01-01 3380 2023-01-01 68179315 68682962 +0.74% Europe Northern Europe

Step 4: Exploratory Data Analysis (EDA)¶

Descriptive statistics¶

  • Mean GDP by continent
  • Styling via Styler.applymap() for visual emphasis
In [221]:
# Group and sort average GDP
gdp_by_continent = (
    df_combined.groupby('Continent')['IMF_GDP_Billion_USD']
    .mean()
    .sort_values()
    .reset_index()
)

# Rename for clarity
gdp_by_continent.columns = ["Continent", "Avg_IMF_GDP_Billion_USD"]

def highlight_by_value(val):
    if val < 3000:
        return 'background-color: lightcoral'
    elif val < 4000:
        return 'background-color: gold'
    else:
        return 'background-color: lightgreen'

styled_df = gdp_by_continent.style.applymap(highlight_by_value, subset=['Avg_IMF_GDP_Billion_USD']) \
    .format({'Avg_IMF_GDP_Billion_USD': "{:,.2f}"})

styled_df
Out[221]:
  Continent Avg_IMF_GDP_Billion_USD
0 Oceania 2,725.17
1 Africa 3,264.69
2 Americas 3,284.94
3 Asia 3,752.68
4 Europe 4,948.61

Step 5: Visualization¶

Population by Continent¶

In [222]:
# Aggregage population by continent
pop_by_continent = (
    df_combined.groupby("Continent")["Population_2023_Count"]
    .sum()
    .reset_index()
)

# Format labels with continent and formatted population
labels = [
    f"{row['Continent']} ({row['Population_2023_Count']:,})"
    for _, row in pop_by_continent.iterrows()
]

fig, ax = plt.subplots(figsize=(8, 8))
ax.pie(
    pop_by_continent["Population_2023_Count"],
    labels=labels,
    autopct="%1.1f%%",     # Shows percent
    startangle=140,
    colors=plt.cm.Set3.colors
)

ax.set_title("🌍 Population Distribution by Continent (2023)", fontsize=14)
plt.tight_layout()
plt.show()
No description has been provided for this image

Top 10 Countries by Population (2023)¶

In [223]:
# Sort by Population 2023
top_population = df_combined.sort_values(by='Population_2023_Count', ascending=False).head(10)

# Plot
plt.figure(figsize=(12, 6))
sns.barplot(x='Country', y='Population_2023_Count', data=top_population, hue = 'Country',
            palette='Blues_d')

plt.title('Top 10 Countries by Population (2023)')
plt.ylabel('Population (Billions)')
plt.xlabel('Country')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image

Treemap showing GDP distribution by continent.¶

In [224]:
import plotly.express as px

# 🌳 Create Treemap
fig = px.treemap(
    df_combined,
    path=["Continent"],
    values="IMF_GDP_Billion_USD",
    color="IMF_GDP_Billion_USD",
    color_continuous_scale="Viridis",
    title="IMF GDP Distribution by Continent (in Billions USD)"
)

fig.update_layout(margin=dict(t=50, l=25, r=25, b=25))
fig.show()

Top 10 Countries by IMF GDP Estimate (2025)¶

In [225]:
# Sort by IMF_GDP_Billion_USD
top_gdp = df_combined.sort_values(by='IMF_GDP_Billion_USD', ascending=False).head(10)

# Plot
plt.figure(figsize=(12, 6))
sns.barplot(x='Country', y='IMF_GDP_Billion_USD', data=top_gdp, palette='Greens_d')

plt.title('Top 10 Countries by IMF GDP Estimate (2025)')
plt.ylabel('GDP (Billion USD)')
plt.xlabel('Country')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image

Scatter Plot: Population vs GDP¶

In [226]:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_combined, x='Population_2023_Count', y='IMF_GDP_Billion_USD', hue='Continent', s=100)

plt.title('Population vs GDP (IMF 2023)')
plt.xlabel('Population (2023)')
plt.ylabel('GDP (Billion USD)')
plt.xscale('log')
plt.yscale('log')
plt.grid(True, which="both", ls="--", linewidth=0.5)
plt.tight_layout()
plt.show()
No description has been provided for this image

From the scatterplot, countries with larger populations tend to have higher GDP, but it's not linear—population alone doesn’t dictate economic output.

Some countries with moderate population may have disproportionately high GDP (e.g., United States).

Others with large populations but low GDP (e.g., some African nations) highlight regional disparities.

Asian and African nations are more concentrated in lower GDP per capita ranges.

Americas and Europe show more economic diversity even among similarly populated countries.

Top 10 with very high GDP but small population¶

In [227]:
# Create a GDP per Capita Column

df_combined["GDP_per_Capita"] = df_combined["IMF_GDP_Billion_USD"] * 1e9 / df_combined["Population_2023_Count"]

# Filter Top 10 High GDP / Small Population
top_gdp_small_pop = (
    df_combined[df_combined["Population_2023_Count"] < 40_000_000]
    .sort_values("IMF_GDP_Billion_USD", ascending=False)
    .head(10)
)

# Filter Top 10 High Population / Low GDP
high_pop_low_gdp = (
    df_combined[df_combined["Population_2023_Count"] > 100_000_000]
    .sort_values("IMF_GDP_Billion_USD", ascending=True)
    .head(10)
)
In [228]:
# Bar chart
plt.figure(figsize=(10, 6))
sns.barplot(
    data=top_gdp_small_pop,
    x="IMF_GDP_Billion_USD",
    y="Country",
    hue='Country',
    palette="Greens_r"
)
plt.title("Top 10: High GDP, Small Population")
plt.xlabel("GDP (Billion USD)")
plt.ylabel("Country")
plt.tight_layout()
plt.show()
No description has been provided for this image

Top 10 with high population but relatively low GDP¶

In [229]:
plt.figure(figsize=(10, 6))
sns.barplot(
    data=high_pop_low_gdp,
    x="IMF_GDP_Billion_USD",
    y="Country",
    hue='Country',
    palette="Reds"
)
plt.title("Top 10: High Population, Low GDP")
plt.xlabel("GDP (Billion USD)")
plt.ylabel("Country")
plt.tight_layout()
plt.show()
No description has been provided for this image

Key Findings¶

  • Population size does not always equate to GDP strength
  • Certain small countries dominate GDP (e.g., Luxembourg)
  • Africa and Asia house large populations but varying GDP levels
  • Americas and Europe show more even distribution between GDP and population

Conclusion¶

This analysis showcases how web scraping, data wrangling, and visualization can uncover socio-economic insights from publicly available datasets. It emphasizes the importance of clean data and contextual visualization.